A0-A1B9  M3 


UNCLASSIFIED 


ON  RELAXATION  ALGORITHMS  BASED  ON  HARKOV  RANDOM  FIELDS 
<U)  ROCHESTER  UNIV  NV  DEPT  OF  COMPUTER  SCIENCE 
P  B  CHOU  ET  AL.  IB  JUL  87  TR-212  NBBB14-82-K-B1B3 


AD-A189  043 


.«(  .«4  .u  .'tl.n.. 


.’vvwyw"*  ww*-*  wv  wu  w\i  >rw  w  u  » 


one  file  copy 


3 


On  Relaxation  Algorithms 
Based  on  Markov  Random  Fields* 

Paul  B.  Chou 
Rajeev  Raman 


Computer  Science  Department 

i" 

The  University  of  Rochester 

L. 

Rochester,  New  York  14627 

ff&EI 

TR  212 

July  10,  1987 

DTIG 

gj?^EL.ECTE| 

JAN  2  1 19881 


Department  of  Computer  Science 
University  of  Rochester 
Rochester,  New  York  14627 


88  5. 


0  84 


On  Relaxation  Algorithms 
Based  on  Markov  Random  Fields* 


Paul  B.  Chou 
Rajeev  Raman 

Computer  Science  Department 
The  University  of  Rochester 
Rochester,  New  York  14627 

TR  212 
July  10, 1987 


ABSTRACT 

Many  computer  vision  problems  can  be  formulated  as  computing  the  minimum  energy  states  of 
thermal  dynamic  systems.  However,  due  to  the  complexity  of  the  energy  functions,  the  solutions  to  the 
minimization  problem  are  very  difficult  to  acquire  in  practice.  Stochastic  and  deterministic  methods  exist  to 
approximate  the  solutions,  but  they  fail  to  be  both  efficient  and  robust  In  this  paper,  we  describe  a  new 
deterministic  method  -  the  Highest  Confidence  First**  algorithm  -  to  approximate  the  minimum  energy 
solution  to  the  image  labeling  problem  under  the  Maximum  A  Posteriori  (MAP)  criterion.  This  method 
uses  Markov  Random  Fields  to  model  spatial  prior  knowledge  of  images  and  likelihood  probabilities  to 
represent  external  observations  regarding  hypotheses  of  image  entities.  Following  an  order  decided  by  a 
dynamic  stability  measure,  the  image  entities  make  local  estimates  based  on  the  combined  knowledge  of 
priors  and  observations.  We  show  that,  in  practice,  the  solutions  so  constructed  compare  favorably  to  the 
ones  produced  by  existing  methods  and  that  the  computation  is  more  predictable  and  less  expensive. 

•This  work  was  supported  in  pan  by  the  Air  Force  Systems  Command,  Rome  Air  Development  Center,  Griffiss  Air  Force  Base,  New 
York  13441-5700,  and  the  Air  Force  Office  of  Scientific  Research,  Bolling  AFB,  DC  20332  under  Contract  No.  F30602-85-C-0008  . 
This  contract  supports  the  Northeast  Artificial  Intelligence  Consortium  (NAIC).  This  work  was  also  supported  in  part  by  U.S.Army 
Engineering  Topographic  Laboratories  research  contract  no.  DACA76-85-C-0001,vui  part  by  NSF  Coordinated  Experimental  Research 
grant  no.  DCR -8320136,  in  part  by  ONR/DARPA  research  contract  no.  N00014-82-K-0193,  and  in  part  by  a  grant  from  the  Eastman 
Kodak  Company.  We  thank  the  Xerox  Corporation  University  Grants  Program  for  providing  equipment  used  in  the  preparation  of  this 
paper. 
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1.  Introduction 

Probability  theory  has  found  many  applications  in  representing  the  uncertainty  of 
various  kinds  of  knowledge  and  in  reasoning  about  the  world  [Feldman  and  Yakimov- 
sky  1974]  [Duda,  Hart,  and  Nilsson  1976]  [Peleg  1980]  [Pearl  1985]  .  It  appeals  to  the  AI 
community  for  many  reasons,  among  which  are  that  it  provides  a  well-developed 
mathematical  theory  for  using  uncertainty  measures  in  making  decisions,  and  that  it  pro¬ 
vides  well-known  ways  of  incorporating  empirical  data.  However,  there  have  been  few  suc¬ 
cessful  attempts  at  utilizing  this  tool  in  practical  machine  vision  systems.  It  is  apparently 
not  because  this  domain  is  any  the  less  "uncertain"  but  because  the  complexity  of  the 
information  in  this  domain  hinders  the  advancement  of  probabilistic  approaches. 

It  is  our  goal  to  demonstrate  the  practicability  of  applying  the  Bayesian-probability 
formalism  to  complex  domains,  such  as  the  image  labeling  problem  discussed  in  this  paper. 
The  image  labeling  problem  is  to  assign  labels  to  image  entities  such  as  regions,  line  seg¬ 
ments,  and  pixels.  The  set  of  labels,  usually  reflecting  the  photometric  and  geometrical 
phenomena  of  the  scene,  is  mutually  exclusive  and  exhaustive  at  a  particular  level  of 
abstraction.  Denote  the  set  of  labels  { /  j  ,/2 ,  •  •  .  ,  lq  }  as  L,  and  the  set  of  the  image  entities 
{  *i,  *2>  •  •  -  >  *iv  }  as  S.  Any  mapping  from  S  to  L  is  a  feasible  solution  to  the  labeling 
problem.  To  choose  an  "optimal"  solution  from  the  set  of  feasible  solutions  -  Q,  image 
observations  as  well  as  prior  knowledge  about  spatial  relations  between  labels  are  used  to 
evaluate  the  goodness  of  each  solution.  This  work  follows  the  probabilistic  model  for  visual 
computation  proposed  in  [Chou  and  Brown  1987b],  Spatial  prior  knowledge  and  local  visual 
observations  are  separately  represented  in  terms  of  probabilities.  This  decoupling  provides 
a  clean  and  uniform  way  of  modeling  information  at  different  levels  of  abstraction,  and 
therefore  to  modularize  the  design  and  implementation  of  probabilistic  systems  Bayes' 
rule  is  used  to  combine  priors  and  observations  to  form  the  a  posteriori  probabilities 
representing  the  updated  knowledge.  The  labeling  problem  is  then  formulated  as  a  minimi¬ 
zation  problem  based  on  the  Bayesian  decision  rationale.  We  shall  show  that  by  using  a 
new  algorithm  proposed  here  to  estimate  the  optimal  solution  to  the  minimization  problem 
so  formulated,  it  is  possible  to  achieve  excellent  results  with  relatively  little  computation, 
given  a  set  of  reasonable  assumptions. 

The  organization  of  this  paper  is  as  follows.  We  discuss  bow  to  encode  the  a  priori 
knowledge  of  image  events  with  the  Markov  random  field  (MRF)  models  in  Section  2  The 
Bayesian  decision  rationale  is  discussed  in  Section  3.  In  Section  4,  we  review  several  sto¬ 
chastic  relaxation  methods  that,  in  principle,  could  find  the  optimal  solutions  given  enough 
computational  resources.  We  describe  a  new  deterministic  estimation  algorithm  in  Section 
5  that  our  experiments  indicate  to  be  superior  to  existing  methods.  Experiments  on  edge 
detection  -  an  instance  of  the  labeling  problem  -  with  both  synthetic  and  natural  images 
are  conducted  with  an  MRF  simulation/estimation  package  implemented  by  the  authors 
Results  from  various  estimation  schemes  are  compared  in  Section  6. 

2.  Markov  Random  Fields  and  Gibbs  Distributions 

Markov  Random  Fields  have  been  used  for  image  modeling  in  many  applications  for 
the  past  few  years  [Hassner  and  Slansky  1980]  [Cross  and  Jain  1983]  [Marroquin,  Mitter. 
and  Poggio  1985]  [Geman  and  Geman  1984]  [Derin  and  Cole  1986]  [Chou  and 
Brown  1987a].  In  this  section,  we  review  the  properties  of  MRF's  and  discuss  how  to 


encode  prior  knowledge  in  this  formalism.  We  refer  the  reader  to  [Kindermann  and 
Snell  1980]  for  an  extensive  treatment  of  MRF’s. 

2.1.  Noncausal  Markovian  Dependency 

Let  X  =  {X,,  s€S}  denote  a  set  of  random  variables  indexed  by  S.  Without  loss  of 
generality,  assume  all  variables  in  X  have  a  common  state  space  L,  so  that  X,tL.  Let 

{X  =  u}  be  the  event  {XM  =«,,,  .  .  .  ,X,N  =  u,N}  ,  where  u  =  . u,N),  w,€L,  is  a 

configuration  of  X.  Since  a  configuration  of  X  is  also  a  feasible  solution  to  the  labeling  prob¬ 
lem,  0  also  denotes  the  set  of  all  possible  configurations. 

Let  £  be  a  set  of  unordered  pairs  (s,,  SjY s  representing  the  "connections"  between  the 
elements  in  S.  The  semantics  of  the  connections  will  become  clear  shortly.  E  defines  a 
neighborhood  system  T  =  {N,  |s€S},  where  N,  is  the  neighborhood  of  s  in  the  sense  that 

(1)  stN„  and 

(2)  r£N,  if  and  only  if  (s,  r)iE. 

X  is  a  Markov  Random  Field  with  respect  to  T  and  P,  where  P  is  a  probability  func¬ 
tion,  if  and  only  if 

( positivity )  P(X  =  w)  >  0  for  all  w€Q  (2.1) 

( Markovianity )  P(X,  =  u,  | Xr  =  ur,  r€S,  r  *s)  =  P(X,  =  u,  \Xr  =  ur,  r£N,)  (2.2) 

The  set  of  conditional  probabilities  on  the  left-hand  side  of  (2.2)  is  called  the  local  charac¬ 
teristics  that  characterizes  the  random  field.  It  can  be  shown  that  the  joint  probability  dis¬ 
tribution  P(X  =  u)  of  any  random  field  satisfying  (2.1)  is  uniquely  determined  by  these  con¬ 
ditional  probabilities  [Besag  1974].  An  intuitive  interpretation  of  (2.2)  is  that  the  contex¬ 
tual  information  provided  by  S  —  s  to  s  is  the  same  as  the  information  provided  by  the 
neighbors  of  s.  Thus  the  effects  of  members  of  the  field  upon  each  other  is  limited  to  local 
interaction  as  defined  by  the  neighborhood.  Notice  that  any  random  field  satisfying  (2.1)  is 
an  MRF  if  the  neighborhoods  are  large  enough  to  encompass  all  the  dependencies. 

2.2.  Encoding  Prior  Knowledge  and  Gibbs  Distributions 

The  utility  of  the  MRF  concept  for  image  labeling  problems  is  that  the  prior 
knowledge  about  spatial  dependencies  among  the  image  entities  can  be  adequately  modeled 
with  neighborhoods  that  are  small  enough  for  practical  purposes.  Very  often,  the  image 
entities  are  regularly  structured  and  prior  distributions  on  the  image  are  homogeneous 
and  isotropic.  In  such  cases,  the  number  of  parameters  needed  to  specify  the  priors  is  just  a 
fraction  of  QM,  where  M  is  the  size  of  the  neighborhoods.  This  is  a  significant  saving  over 
QN  -  the  number  of  possible  configurations,  especially  when  M  is  small. 

There  are  difficulties,  as  stated  in  [Geman  and  Geman  1984],  associated  with  using 
the  MRF  formulation  by  itself: 

(1)  The  joint  distribution  of  the  X,  is  not  apparent; 

(2)  It  is  extremely  difficult  to  spot  local  characteristics,  i.e.,  to  determine  when  a  given  set 
of  functions  are  conditional  probabilities  for  some  distribution  on  0. 
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(1)  is  not  a  serious  problem  for  some  special  classes  of  MRF  models  such  as  Markov 
Mesh  (MM)  processes  [Kanal  1980],  since  their  joint  distributions  can  be  represented  in  a 
recursive  formulation  due  to  the  casual  dependency  assumed.  For  (2),  parametric  probabil¬ 
ity  distributions  such  as  Gaussian  and  binomial,  have  been  been  used  in  the  literature 
[Cross  and  Jain  1983]  [Cohen  and  Cooper  1987].  Using  such  distributions  further 
simplifies  the  encoding  of  the  local  characteristics  and  has  shown  some  impressive  results 
on  modeling  and  generating  texture  patterns.  However,  whether  these  kinds  of 
simplifications  preserve  the  power  of  MRF’s  for  modeling  spatial  knowledge  remains  ques¬ 
tionable. 

Fortunately,  these  difficulties  vanished  when  the  following  property  of  MRF’s  was 
realized. 


Hammer sley -Clifford  Theorem:  A  random  field  X  is  an  MRF  with  respect  to  a  neighborhood 
system  T  if  and  only  if  there  exists  a  function  V  such  that 

-jVM 

P(u)  =  - — -  for  all 

z 

where  T  and  Z  are  constants  and 

U(u)  =  2Vt(“)-  (2  4) 

c(C 


C  denotes  the  set  of  totally  connected  subgraphs  (cliques)  with  respect  to  T.  Z  is  a  normal¬ 
izing  constant  and  is  called  the  partition  function. 

The  probability  distribution  defined  by  (2.3)  and  (2.4)  is  called  a  Gibbs  distribution 
with  respect  to  I\  The  class  of  Gibbs  distributions  has  been  extensively  applied  to  model 
physical  systems,  such  as  ferromagnets,  ideal  gases,  and  binary  alloys.  When  such  systems 
are  in  a  state  of  thermal  equilibrium  ,  the  fluctuations  of  their  configurations  follow  a 
Gibbs  distribution.  In  statistical  mechanics  terminology,  U  is  the  energy  function  of  a  sys¬ 
tem.  The  Ve  functions  represent  the  potentials  contributed  to  the  total  energy  from  the 
local  interactions  of  the  elements  of  clique  c.  T,  the  temperature  of  the  system,  controls  the 
"flatness"  of  the  distribution  of  the  configurations. 

Gibbs  distributions,  and  therefore  MRF’s,  possess  a  property  that  appears  to  be  desir¬ 
able  for  modeling  -  when  constrained  by  a  fixed  expected  value  of  some  sufficient  statistic  of 
the  random  field,  the  maximum  entropy  distribution  among  the  class  of  distributions  com¬ 
patible  with  the  constraint  is  a  Gibbs  distribution. 

The  MRF -Gibbs  equivalence  not  only  relates  the  local  conditional  probabilities  to  the 
global  joint  probabilities,  but  also  provides  us  a  conceptually  simpler  way  of  specifying 
MRF’s  -  specifying  potentials.  The  importance  of  the  joint  probabilities  will  become  evident 
in  the  next  section.  The  local  characteristics  can  be  computed  from  the  potential  function 
through  the  following  relation: 


2  v*i“> 


P(X,  =  u,\Xr  =  urir*s)  =  — - y 
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where  C,  is  the  set  of  cliques  that  contain  s,  and  u'  is  any  configuration  of  the  field  that 
agrees  with  u  everywhere  except  possibly  s. 

There  has  been  little  work  that  applies  statistical  estimation  methods  to  estimate 
parameters  used  for  specifying  MRFs.  [Cross  and  Jain  1983]  applies  a  coding  scheme  to 
estimate  the  parameters  in  their  binomial  distribution  models  using  a  maximum  likelihood 
criterion.  [Elliott  and  Derin  1984]  uses  a  least-square-fit  method  to  estimate  potential 
functions  in  the  Gibbs  distributions  of  their  texture  models.  These  methods  are  good  when 
many  uncorrupted  realizations  are  available.  When  such  data  are  difficult  to  acquire,  choos¬ 
ing  the  clique  potentials  on  an  ad  hoc  basis  has  been  reported  to  produce  promising  results 
[Geman  and  Geman  1984]  [Marroquin,  Mitter,  and  Poggio  1985].  Our  experiments  (Sec¬ 
tion  6)  have  also  shown  good  results.  These  results  are  not  surprising  since  the  notion  of 
clique  potentials  provides  a  simple  mapping  from  "qualitative"  spatial  knowledge  to 
numeric  values  of  the  parameters  specifying  the  MRF’s. 

3.  Bayesian  Decision  Rationale  and  Optimality  of  Solutions 

At  various  levels  of  a  visual  hierarchy,  estimations  (decisions)  must  be  made  based  on 
the  information  available.  The  estimation  procedures  become  complex  when  the  informa¬ 
tion  is  uncertain,  which  is  usually  the  case  in  visual  processing.  In  this  section,  we  exam¬ 
ine  the  Bayesian  decision  rationale  and  the  optimal  solutions  to  the  labeling  problem  with 
respect  to  this  rationale. 


3.1.  A  Posteriori  Probabilities 

Section  2  described  how  to  encode  prior  spatial  knowledge  using  the  MRF  formalism. 
Incorporating  the  image  observations,  Bayes’  rule  can  then  be  used  to  derive  the  a  pos¬ 
teriori  probabilities  on  Q  from  the  a  priori  model  of  the  image. 

(B.,es-R„l„  f<«|0).  ,3.1, 


O  denotes  the  image  observations.  The  likelihood  of  an  event  {X  =  u}  given  O,  P(0  |  w),  is 
usually  derived  from  the  image  degradation  model  involving  imaging  noise  and  blur 
[Geman  and  Geman  1984].  [Sher  1987]  and  [Bolles  1977]  show  methods  to  generate  likel¬ 
ihood  functions  from  either  probabilistic  models  or  statistical  data. 

For  a  shift-invariant  point-spread  function  and  white  Gaussian  noise,  the  a  posteriori 
distribution  associated  with  the  a  priori  distribution  defined  by  (2.3)  and  (2.4)  is  a  Gibbs 
distribution  with  respect  to  a  neighborhood  system  related  to  T  and  the  support  of  the 
point-spread  function  [Geman  and  Geman  1984].  For  simplicity,  we  assume  the  following 
conditional  independence,  that  is  generally  true  when  the  noise  field  is  independently  dis¬ 
tributed. 

P(0|u)  =  nP(O.I«.)  (3  2) 

«ts 

0,  denotes  a  set  of  image  observations  over  a  spatial  region  dependent  on  s,  typically 
including  s  and  its  spatially  adjacent  elements.  This  assumption  appears  to  be  very  useful 
for  fusing  and  modeling  early  visual  modules  [Chou  and  Brown  1987a]  and  for  texture 
modeling  [Derin  and  Cole  1986].  The  a  posteriori  MRF  thus  has  the  same  neighborhood 


(3.3) 


system  T  with  the  e  ergy  function 

U0(u)  =  2Vc(u)  -  T21nP(0,|tJ,) 

etc  «(S 


3.2.  Optimal  Labelings 

The  goodness  of  a  labeling  u,  following  the  Bayesian  formalism,  is  evaluated  in  terms 
of  its  a  posteriori  expected  loss, 

Loss(w|0)  =  2  (3  4) 

where  loss  (u,u)  is  a  penalty  associated  with  the  estimate  <1>  while  the  "truth"  is  u. 

One  question  concerning  the  applicability  of  (3.4)  is  which  loss  function  should  be 
used  for  a  given  task.  Except  for  few  simple  cases,  the  answer  to  this  question  usually 
relies  on  subjective  judgements.  One  popular  choice  is  assigning  a  constant  penalty  to 
incorrect  estimates:  loss(u,u)  equals  to  a  constant  (positive)  value  whenever  u*u,  and  0 
otherwise.  Using  this  loss  function,  the  configuration  minimizing  (3.4)  maximizes  the  a  pos¬ 
teriori  probability  P(u\0),  and  therefore  minimizes  the  a  posteriori  energy  (3.3)  in  the  MRF 
formalism.  This  Maximum  A  Posteriori  (MAP)  criterion  has  been  widely  applied  to  the 
labeling  problem  [Feldman  and  Yakimovsky  1974]  [Geman  and  Geman  1984]  [Derin  and 
Cole  1986]  [Murray  and  Buxton  1987]  [Cohen  and  Cooper  1987].  Marroquin  et  al  [1985] 
suggest  that  the  number  of  mislabeled  image  entities  of  an  estimation  is  a  better  loss  meas¬ 
ure  for  the  labeling  problem.  They  derive  the  Maximizer  of  the  a  Posteriori  Marginals 
(MPM)  estimation  -  choosing  the  configuration  u  =  («,,,  •  •  • , u,s, )  such  that 

u,  =  max  P,{1\0)  Vs€S,  (3.5) 

where  P,(/|0)  denotes  the  a  posteriori  marginal  probability  of  l  on  s.  In  their  experiments 
the  MPM  estimator  is  shown  to  be  superior  to  the  MAP  criterion  when  the  signal  to  noise 
ratio  is  low. 

Notice  that  the  rationale  of  minimizing  the  loss  function  in  (3.4)  does  not  take  the  cost 
of  computation  into  account,  despite  the  fact  that  computational  cost  is  usually  a  primary 
consideration  in  image  understanding  applications  because  of  their  immense  configuration 
spaces.  A  sub-optimal  estimator  with  an  effective  computation  procedure  would  be  much 
more  useful  than  an  optimal  estimator  that  no  one  could  ever  compute.  It  is  believed  that 
the  exact  evaluation  of  MRF  statistical  moments,  and  therefore  (3.4),  is  generally  impossi¬ 
ble  since  no  analytic  solutions  exist  [Hassner  and  Slansky  1980]  [Geman  and 
Geman  1984].  MAP  and  MPM  can  not  be  exactly  determined  for  the  same  reason,  except 
for  some  simple  energy  functions.  In  the  rest  of  the  paper,  we  discuss  several  numerical 
approaches  for  the  approximate  evaluations  of  the  MAP  and  MPM  estimations  in  the  MRF 
formalism. 

4.  Stochastic  Relaxation  Methods 

One  method  that  has  been  successfully  used  to  analyze  the  behavior  of  complex  sys¬ 
tems  is  generating  sample  configurations  of  a  given  system  through  stochastic  simulations. 
Briefly,  the  Monte  Carlo  method  of  estimating  the  ensemble  average  of  a  variable  Y  <  u\ 


<Y  >  =  Jy  (u)dP(u), 

Q 

is  averaging  its  values  over  a  set  of  samples  {  <•>!,  •  ■  •  ,uR  }  drawn  from  G.  If  the  sampling 
of  u’ s  follows  the  distribution  P,  then  <  Y"  >  can  be  approximated  by 

<Y>  -  ±2Y{ar). 

K  r»l 

We  are  interested  in  sampling  procedures  that  generate  configurations  according  to 
Gibbs  distributions  in  the  form  of  (2.3).  With  such  procedures,  the  sample  frequencies  of  the 
realizations  of  X,  can  be  used  as  approximations  for  the  marginal  probabilities,  i.e.,  MPM 
can  be  estimated;  the  configurations  with  higher  probabilities  are  more  likely  to  be  sam¬ 
pled,  and  therefore  MAP  estimation  becomes  possible  (see  Section  4.2).  Several  procedures 
exist  for  this  purpose.  The  basic  idea  of  these  procedures  is  to  construct  a  regular  Markov 
chain  whose  states  correspond  to  the  configurations  of  the  system  with  the  limiting  distri¬ 
bution  being  the  desired  Gibbs  distribution.  That  is,  construct  Pc  -  the  transition  matrix  of 
the  chain  -  in  such  a  way  that  the  following  condition  holds. 

nPc  =  it,  (4.1) 

where  it  is  the  desired  Gibbs  measure.  At  equilibrium,  the  system’s  configurations  are  dis¬ 
tributed  according  to  it  since  it  is  the  unique  invariant  measure  of  the  constructed  Markov 
chain  [Kemeny  and  Snell  1960]  . 

Consider  each  state  transition  of  the  Markov  chain  involving  only  the  change  of  the 
state  of  a  single  entity  in  the  system.  To  fulfill  the  requirement  of  the  chain  being  regular, 
the  procedure  must  continue  to  "visit"  every  entity.  Let  s(t)  be  the  entity  being  visited  at 
time  t.  The  change  of  X,(()  would  result  a  change  of  the  system  energy  by  the  amount 
specified  by  the  configurations  of  those  cliques  that  contain  aU)  according  to  (2.4).  Stochas¬ 
tic  sampling  procedures  reminiscent  of  "relaxation"  can  be  designed  in  the  sense  that  the 
state  transition  of  the  entity  being  visited  is  stochastically  decided  by  the  states  of  the 
neighboring  entities  and  itself.  We  will  describe  two  of  the  stochastic  relaxation  pro¬ 
cedures,  namely  the  Metropolis  algorithm  [Metropolis  et  al.  1953]  and  the  Gibbs  sampler 
[Geman  and  Geman  1984],  for  their  representativeness.  Other  variations  basically  follow 
the  same  principle  and  serve  special  purposes  [Hassner  and  Slansky  1980]  [Cross  and 
Jain  1983]  [Hinton  and  Sejnowski  1983]  . 

4.1.  The  Metropolis  Algorithm  and  the  Gibbs  Sampler 

Let  XU)  denotes  the  state  of  the  system  at  time  step  t.  The  state  transition  from  step 
t  to  f +1  of  the  Markov  chain  generated  by  the  Metropolis  sampling  algorithm  consists  of 
two  basic  steps: 

(1)  Randomly  select  a  new  configuration  u'  (  randomly  visit  an  entity  s  and  choose  a  new 
state  «',),  and  compute  the  energy  change  A E  =  Eiu')  -  E(X(t))  . 

(2)  If  A£<0  ,  set  XU+1)  =  Otherwise,  set  XU  +  1)  to  «'  or  XU)  with  probabilities 

=  «_A£/r  and  1  -  t  ^/T  respectively. 

Allowing  transitions  with  energy  increases,  a  common  characteristic  of  all  stochastic 
relaxation  procedures,  prevents  the  sampling  process  from  getting  stuck  at  states  of  local 


energy  minimum  -  an  undesirable  property  of  every  deterministic  hill-climbing  procedure 
In  contrast  to  the  explicit  use  of  the  energy  difference  in  the  Metropolis  algorithm,  the 
Gibbs  sampler  uses  the  local  characteristics  to  construct  a  Markov  chain.  A  state  transition 
of  the  Gibbs  sampler  also  consists  two  steps: 

(1)  Visit  an  entity  s. 

(2)  Randomly  select  the  new  state  u',  for  Xg(£+1)  following  the  distribution 
v(X,(t  +1)  =  «, \Xr(t),r *s).  Having  the  form  in  (2.5),  this  distribution  is  generally 
easy  to  compute. 

For  binary  systems,  the  Gibbs  sampler  is  equivalent  to  the  widely  used  "Heat  Bath” 
algorithm  -  changing  the  state  with  probability  —  .  Like  other  relaxation  methods. 

the  above  procedures  suggest  the  use  of  a  parallel  implementation  since  "updating"  the  Xs'& 
requires  propagating  information  only  among  neighboring  computing  units.  Extra  caution 
must  be  paid  to  the  updating  patterns  of  synchronous  machines.  For  the  Metropolis  and 
Heat  Bath  algorithms,  using  any  prescribed  updating  order  may  result  in  the  Markov  chain 
not  converging  to  the  desired  Gibbs  distribution  it  [Marroquin  1985],  Our  experiments  use 
the  Gibbs  sampler  exclusively  because  it  guarantees  the  coincidence  of  it  with  the  invari¬ 
ant  measure  of  the  chain  as  long  as  neighboring  entities  are  not  updated  simultaneously 

4.2.  The  Monte  Carlo  and  Simulated  Annealing  Methods 

The  stochastic  relaxation  scheme  can  be  used  to  approximate  the  a  posteriori  margi¬ 
nal  probabilities  for  the  MPM  estimation  by  simulating  the  equilibrium  behavior  of  the  a 
posteriori  MRF.  Since  the  Markov  chain  constructed  by  either  the  Metropolis  algorithm  or 
the  Gibbs  sampler  leads  to  the  desired  limiting  distribution  regardless  of  its  initial  state, 
the  law  of  large  numbers  suggests  the  marginal  probability  P,(l\0)  be  approximated  by  the 
sample  frequency  of  X,  =  1  at  equilibrium,  that  is, 

P,(/|0)  *  — [—  26(X,<t)-l)  (4-2» 

n  *  t  =* 

where  5(0)  =  1,  and  0  elsewhere,  k  is  the  number  of  steps  for  the  chain  to  reach  equili 
brium,  and  n  is  the  total  number  of  steps  of  the  simulation.  Practically,  experimentation  is 
needed  to  determine  how  large  n  and  k  should  be  to  achieve  a  desirable  approximation 
accuracy  given  an  arbitrary  MRF.  Cross  and  Jain  [1983]  have  observed  that  in  less  than 
10  iterations  (full  sweeps  over  the  image  entities),  their  texture  modeling  system  becomes 
"stable"  when  sampled  by  a  variation  of  the  Metropolis  algorithm.  In  general,  in  the  order 
of  hundreds  of  iterations  are  needed  for  the  MPM  estimation 

The  system  temperature  •  T  in  (2.3)  -  also  plays  an  important  role  in  MRF  simula¬ 
tions.  With  low  temperatures,  the  Gibbs  distribution  strongly  favors  the  low  energy 
configurations,  but  the  time  required  for  the  system  to  reach  equilibrium  may  be  long.  The 
system  may  reach  equilibrium  faster  at  higher  temperatures,  but  the  configurations  are 
more  evenly  sampled;  i.e.,  it  may  require  more  samples  to  make  accurate  MPM  estimations 
The  idea  of  simulated  annealing  [Kirkpatrick,  Gelatt,  and  Vecchi  1983],  obviously  inspired 
by  physical  annealing,  is  to  reach  the  minimum  energy  states  of  a  system  by  starting  the 
system  at  a  high  temperature  and  gradually  reducing  it  In  doing  so  the  system  tends  to 
respond  to  large  energy  differences  at  the  beginning,  and  is  likely  to  find  a  good  minimum 


energy  state  independent  of  its  starting  state.  As  the  temperature  decreases,  the  system 
tends  to  respond  to  small  energy  differences,  and  ideally  settles  at  the  lowest  energy  states 
ever  encountered  The  decreasing  sequence  of  temperatures,  called  the  annealing  schedule, 
decides  the  effectiveness  of  this  process.  If  the  time  spent  at  each  temperature  is  not 
enough,  the  system  may  not  converge  to  the  global  minimum  states.  On  the  other  hand,  it 
is  often  computationally  prohibitive  to  use  a  slowly  decreasing  schedule.  Geman  and 
Geman  [84]  have  derived  an  upper  bound  for  the  annealing  schedules  so  that  the  schedules 
slower  than  this  bound  are  guaranteed  to  converge  to  the  global  minimum  energy  states 
However,  this  bound  is  very  difficult  to  decide  in  practice  since  it  relates  to  the  range  of 
energy  values  of  the  system 

Simulated  annealing  has  been  applied  in  many  computer  vision  tasks  that  involve 
optimization  over  exponential  spaces,  including  the  MAP  estimation  [Geman  and 
Geman  1984]  and  the  stereo  matching  problem  [Barnard  1987].  One  major  concern  of 
using  the  stochastic  relaxation  scheme  is  its  efficiency:  at  what  cost  can  this  scheme  deliver 
satisfactory  results'’  Not  surprisingly,  the  cost  is  intolerable  for  many  applications.  In  the 
next  section,  we  describe  a  new  deterministic  method  to  approximate  MAP.  This  method, 
following  a  search  path  suggested  by  the  visual  observations  to  find  a  minimum  energy 
state,  appears  to  give  results  favorably  comparable  in  practice  to  the  existing  relaxation 
methods  while  being  computationally  less  expensive. 

5.  Deterministic  Relaxation  Methods 

Exact  calculation  of  the  MAP  estimate  is  computationally  prohibitive.  For  vision  sys¬ 
tems  that  require  predictable  results  in  reasonable  time  periods,  using  suboptimal  estima¬ 
tion  criteria  and  or  heuristics  in  searching  for  solutions  seems  to  be  a  reasonable  alterna¬ 
tive  to  the  stochastic  relaxation  scheme.  In  [Derin  and  Cole  1986],  MAP  estimations  are 
performed  on  narrow  strips  of  the  image.  The  strips  are  limited  to  at  most  four  rows  wide 
so  that  MAP  can  be  exact  computed  for  each  strip  by  a  dynamic  programming  algorithm  at 
feasible  cost.  For  each  estimation,  only  the  estimate  of  the  first  row  of  a  strip  is  kept  It 
serves  as  the  boundary  condition  for  the  next  strip  consisting  of  the  rest  of  the  rows  and  a 
new  one  Though  limiting  the  extent  of  the  (column  wise)  interactions,  the  texture  segmen 
tation  results  appear  to  be  impressive.  Before  we  describe  the  proposed  heuristic-based 
algorithm,  we  examine  an  iterative  relaxation  method  for  estimating  MAP. 

5.1.  Iterative  Energy  Minimization 

A  simple  version  of  deterministic  iterative  relaxation  methods  for  energy  minimiza¬ 
tion  is  the  Metropolis  algorithm  without  randomness:  Start  with  an  initial  configuration 
At  each  iteration  through  the  image  entities,  the  state  of  each  entity  is  either  changed  to 
the  state  that  yields  maximal  decrease  of  the  energy,  or  is  left  unchanged  if  no  energy 
reduction  is  possible  The  process  stops  when  no  more  changes  can  be  made.  This  algo¬ 
rithm  is  guaranteed  to  find  a  local  minimum  of  the  energy  function  since  each  iteration 
strictly  decreases  the  energy  value  and  there  are  only  a  finite  number  of  different  values  of 
the  energy  function.  For  parallel  implementation,  convergence  is  assured  if  the  neighbor 
ing  entities  are  not  updated  simultaneously 

Unavoidably,  the  local  minimum  obtained  by  the  above  algorithm  may  be  far  from 
optimal  Two  enhancements  are  apparently  helpful: 


(1)  Start  with  a  better  initialization  of  the  MRF.  The  best  one  can  hope  is  that  the  energy 
▼alue  of  the  initial  configuration  falls  into  the  valley  of  the  global  minimum.  One  pos¬ 
sibility  is  to  use  the  maximum  likelihood  estimates  (MLE)  -  X,(0 )  =  u,  if 
max  P(0,|Z)=P(0,  |«,). 

(2)  Escape  from  shallow  valleys.  By  changing  the  states  of  more  than  one  entity  at  once, 
the  new  configuration  may  lead  to  a  better  local  minimum.  In  a  procedure  described 
in  [Cohen  and  Cooper  1987],  the  entities  with  small  preferences  of  the  current  states 
over  the  others  are  assigned  new  states  when  a  local  minimum  is  reached.  The  relaxa¬ 
tion  restarts  with  the  new  configuration  as  the  initialization.  At  each  convergence,  the 
magnitude  of  the  local  minimum  is  estimated,  The  procedure  halting  when  no 
significant  change  of  the  magnitudes  is  observed.  The  hope  is  that  the  deepest  valley 
will  be  found  in  this  process. 

Unfortunately,  these  two  modifications  are  not  adequate.  The  local  MLE’s  are  good 
only  when  the  noise  process  is  correctly  modeled  in  computing  the  likelihoods  and  there  are 
significant  differences  among  the  likelihoods  of  the  hypotheses  Frequently  these  conditions 
cannot  be  met.  Cohen  and  Cooper’s  procedure,  obviously  a  compromise  between  stochastic 
and  deterministic  relaxation  methods,  suggests  a  tradeoff  between  speed  and  performance. 

The  algorithm  of  this  paper  blends  the  initialization  into  the  estimation  process. 
Instead  of  stepping  through  the  configuration  space  12,  this  algorithm  constructs  a 
configuration  with  a  local  minimal  energy  measure.  Observable  evidence  and  spatial  prior 
knowledge  are  combined  in  the  process  of  the  construction,  resulting  in  better  results  and 
efficiency.  The  details  of  this  algorithm  are  described  next. 

5.2.  The  Highest  Confidence  First  Algorithm 

To  see  how  this  algorithm  works,  some  terminology  needs  to  be  introduced.  Let 
L  =  LU{/0}  denote  the  augmented  label  set,  where  L  =  {/ 1 ,  *  *  '  ,Iq}  is  the  set  of  labels,  and 
l0  is  the  null  label  corresponding  to  the  "uncommitted"  state  in  the  construction.  Let  S2  = 
{«  =  («!,  •  •  •  ,uN)\u,lL,VstS}  denote  the  augmented  configuration  space.  Define  the  aug¬ 
mented  a  posteriori  local  energy  of  HL  with  respect  to  siS  and  a  configuration  u£  12  as 

£,(/)=  2  v'c(u')  -  rinP(0,|2),  (5ill 

c:  tic 

where  «'€ 0  is  the  configuration  that  agrees  with  u  everywhere  except  u',  =  l,  and  V'c  is 
0  if  ur  =  l0  for  any  r  in  c,  otherwise  it  is  equal  to  Vc  •  the  potential  function. 

The  basic  idea  of  this  algorithm  is  to  construct  a  sequence  of  configurations 
uPtu>x ,  •  •  •  with  the  starting  configuration  w°  =  (/<>.  *  •  •  ,lo),  and  a  terminal  configuration 
fa/€l2,  where  Uo(uf)  is  a  local  minimum  with  respect  to  Q.  We  Bay  an  entity  s  has  made  its 
current  decision  l,  HL,  if  the  just-constructed  (current)  configuration  u  in  the  sequence  has 
the  component  u,  =  l,  and  it  has  not  made  a  decision  if  u,  =  l0.  Once  an  entity  makes  a 
decision,  it  can  change  this  decision  to  other  labels  of  L  but  not  to  lo-  To  ensure  the  quality 
of  the  resulting  estimate  -  uf ,  at  each  step  of  the  construction  we  permit  only  the  least 
"stable"  entity  to  change/make  its  decision.  We  define  the  stability  of  s  with  respect  to  the 
current  configuration  u,  u,  =  l,  as 
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G,(«)  =  ^min^  A£,(k,/)  if  UL  (5.2a) 

G,(«)  =  -  min  A E,(k,j)  if  l  =  Iq  and  jiL  s.t.  E,(J )  =  min  E,(k),  (5.2b) 

h(L,k*j  k(L 

where  hE,(j,k)  =  Et(j )  —  E,(k )  with  respect  to  u. 

The  stability  defined  above  is  a  combined  measure  of  the  observable  evidence  and  the 
a  priori  knowledge  about  the  preferences  of  the  current  state  over  the  other  alternatives.  A 
negative  value  of  G,  that  is  always  true  for  (5.2b),  indicates  a  more  stable  configuration 
will  result  from  an  alternative  decision.  Since  an  entity  has  no  effect  on  its  neighbors 
unless  it  has  made  a  decision,  the  entities  with  large  likelihood  ratios  of  one  label  over  the 
others  -  strong  external  evidence  in  favor  of  a  label  -  will  be  visited  early  in  the  construc¬ 
tion  sequence.  The  entities  with  little  idea  from  the  observations  will  collect  information 
from  the  neighbors’  decisions  to  make  their  decisions;  an  early  decision  will  be  altered  if 
the  neighbors’  later  decisions  are  strongly  against  it.  In  this  way,  every  step  of  this  con¬ 
struction  makes  a  maximal  progress  based  on  the  current  knowledge  about  the  field  -  the 
G’ s.  This  Highest  Confidence  First  algorithm  is  expected  to  find  the  estimate  that  is  "con¬ 
sistent"  with  the  observations  and  the  a  priori  knowledge. 

The  Highest  Confidence  First  algorithm  can  be  implemented  serially  with  a  heap 
(priority  queue)  maintaining  the  visiting  order  of  the  construction  according  to  the  values 
of  G’s  in  such  a  way  that  the  top  of  the  heap  is  the  entity  with  the  smallest  G  value.  Updat¬ 
ing  the  top's  decision  will  cause  the  changes  of  its  neighbors’  G-values,  and  therefore  the 
structure  of  the  heap.  The  following  is  the  pseudo  code  for  the  Highest  Confidence  First 
algorithm: 


«  —  (to>  '  ‘  '  ,fo>; 

top  -  Create_Heap(w); 
while  (G, op  <  0)  { 
s  =  top; 

Change_State(w,); 
Update_G)G,  ); 
Adjust-Heap(s); 
foreach  (r  (  N,)  { 
Update_G)Gr); 
Adjust-Heap)  r); 

} 

} 

return)  w); 


Change-State) w,)  changes  the  current  state  u,  of  s  to  the  state  l  such  that 
AE,(l,u,)  =  min  (*,«,)  if  or  E,(l )  =  min E,(k)  if  «,=/„•  Upon  this  change 

kfL.hmu,  ktL 

taking  place,  the  stability  of  s  changes  to  positive.  Update.G  is  called  for  every  entity  that 
is  affected  by  this  change,  namely  the  neighbors  of  s  according  to  )5.1),  to  update  their  sta¬ 
bility  measures  with  respect  to  the  new  configuration  Adjust-Heap) r)  maintains  the  heap 
property  by  moving  r  up  or  down  according  to  its  updated  G-value. 

Several  desirable  properties  of  this  procedure  can  easily  be  verified: 
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(1)  Termination:  This  procedure  always  returns  in  finite  time.  To  see  this  property,  let  us 
consider  the  two  types  of  Change-State  -  making  and  changing  a  decision  - 
separately.  The  procedure  can  make  at  most  N  decisions,  one  for  each  entity,  since 
nullifying  decisions  is  impossible.  Let  D  =  ( Sd,S—Sd )  be  a  partition  of  S  such  that 
So  is  the  set  of  entities  that  have  made  decisions.  Let  Qo  = 
{<o€fi|co,€L  VsSSj),  and  u,  =  l<t  Vs€S-So}  .  Since,  by  (5.1)  and  (5.2a),  changing 
the  decision  of  siSp  strictly  decreases  the  function  t/j>  :  fin  -*  R, 

UD(u )  =  2V'e<«>  -  T  2  lnP(0,|<o,), 

e  t(So 

the  procedure  can  make  only  a  finite  number  of  changes  with  respect  to  a  fixed  parti¬ 
tion  D.  There  are  only  a  finite  number  of  partitions,  therefore  the  total  number  of 
decision  changes  is  finite. 

(2)  Feasibility:  The  returned  configuration  is  in  Q  -  the  space  of  feasible  solutions.  For  if 
otherwise,  there  exists  an  s  such  that  u,  =  la.  From  (5.2b),  G,  <  0.  This  violates  the 
heap  invariant  property  since  it  requires  G(op  2  0  to  exit  the  while  loop. 

(3)  Optimality:  The  returned  configuration  has  the  locally  minimal  energy  measure  with 
respect  to  0.  That  is,  changing  the  decision  of  any  single  entity  can  not  decrease  the 
a  posteriori  energy  measure  Uq.  As  above,  this  property  can  easily  be  derived  from 
(5.2a)  and  the  heap  properties. 

This  implementation  takes  O(N)  comparisons  to  create  the  heap  and  0(log(AD)  to 
maintain  the  heap  invariance  for  every  visit  to  an  entity,  provided  the  neighborhood  size  is 
small  relative  to  N.  The  overheads  of  heap  maintenance  are  well  repaid  since  the  procedure 
makes  progress  for  every  visit,  in  contrast  to  the  iterative  relaxation  procedure  (Section 
5.1)  that  may  make  only  few  changes  per  iteration  ( N  visits).  Our  experiments  show  that 
on  the  average,  less  than  one  percent  of  the  entities  are  visited  more  than  once  using  the 
proposed  algorithm  while  the  deterministic  relaxation  procedure  takes  around  10  iterations 
to  reach  a  local  minimum.  This  advantage  becomes  more  evident  as  the  number  of  entities 
gets  larger.  Experimental  results  are  strongly  in  favor  of  the  proposed  algorithm  for  both 
efficiency  and  correctness.  They  are  discussed  in  Section  6 

5.3.  Possible  Extensions 

Since  the  order  of  the  deterministic  decisions  of  the  entities  in  a  cooperative  network 
is  crucial  to  the  final  mutual  agreement,  the  proposed  algorithm  assumes  that  good  results 
can  be  obtained  by  delaying  the  decisions  of  those  entities  who  have  little  idea  about  what 
to  do  until  they  get  enough  help  from  their  neighbors.  This  heuristic  can  be  used  along 
with  other  computational  methods  to  achieve,  perhaps,  better  results. 

Let  us  look  more  closely  at  the  process  of  achieving  a  consensus  using  this  heuristic. 
At  each  stage,  Sq  consists  of  a  set  of  isolated  clusters.  A  cluster  is  a  set  of  spatially  con¬ 
nected  (with  respect  to  O  entities.  We  say  two  clusters  are  isolated  from  each  other  if  none 
of  the  entities  of  a  cluster  is  a  neighbor  of  any  entity  of  the  other  cluster.  Each  cluster 
corresponds  to  an  MRF  with  free  boundaries  in  our  formalism.  When  an  entity  makes  a 
decision,  a  cluster  is  created  or  expanded,  or  clusters  are  merged.  When  an  entity  changes  a 
decision,  the  energy  of  the  corresponding  MRF  is  reduced  Eventually,  all  the  clusters  are 
merged  and  the  final  agreement  corresponds  to  a  local  minimum  configuration  of  the 


corresponding  MRF. 

The  notion  of  growing  clusters  suggests  a  natural  partition  of  the  image.  At  any 
instance,  the  entities  belonging  to  the  same  cluster  are  tightly  related,  but  they  are 
independent  of  the  members  of  other  clusters.  The  addition  of  a  new  member  to  a  cluster 
may  change  the  decisions  of  the  old  members,  but  the  changes  are  expected  to  be  small  due 
to  the  way  the  clusters  are  constructed.  Therefore,  it  makes  sense  to  compute  the  MAP  esti¬ 
mates  exactly  for  small  clusters  early  in  the  process.  We  believe  that  by  doing  so  the 
results  would  be  better  than  the  results  using  the  horizontal  strip  partition  as  in  [Derin 
and  Cole  1986]. 

The  process  of  growing  clusters  is  similar  to  annealing  in  the  sense  that  it  responds  to 
large  energy  differences  earlier  than  small  ones.  Nondeterminism  can  be  introduced  to 
those  entities  that  stay  "unstable"  -  the  entities  on  or  exterior  to  the  border  of  the  clusters  - 
late  in  the  process,  since  more  spatial  information  is  required  for  them  to  reach  a  globally 
satisfactory  agreement. 

The  Highest  Confidence  First  algorithm  can  be  implemented  with  a  set  of  cooperative 
computing  units.  Consider  a  winner-take-all  network  where  each  unit  corresponds  to  an 
entity  of  the  image  [Feldman  and  Ballard  1981].  Only  the  units  with  the  smallest  stability 
measures  can  "fire"  at  one  instant;  each  unit  maintains  the  knowledge  about  the  neighbor¬ 
ing  units  so  that  its  stability  measure  can  be  updated  immediately  should  any  neighbor 
change  its  state.  The  parallelism  gained,  however,  is  limited  due  to  the  sequential  firing 
order. 

6.  Experiments  and  Results 

We  have  chosen  to  tackle  the  well  studied  problem  of  edge  detection  using  MRFs  as 
the  underlying  formalism.  The  labeling  problem  in  this  context  is  to  assign  to  each  edge 
element  a  label  from  the  set  {EDGE,  NON-EDGE}.  Each  of  these  edge  elements  is  modeled 
as  an  MRF  entity.  The  MRF  entities  are  considered  to  be  situated  on  the  boundary  between 
two  pixels  (see  Fig.  1).  The  MRF  model  used  is  similar  to  the  "Line  Process"  MRF  used 
both  by  Geman  et  al  [1984]  and  MaiToquin  et  al  [1985].  Hence  the  MRF  is  binary,  with 
2{N2  - N )  entities  where  the  image  is  a  NxN  rectangular  pixel  array. 

6.1.  Construction  of  Potential  Functions  to  Encode  Prior  Knowledge 
The  spatial  relationships  between  entities  we  wish  to  enforce  include: 

(1)  To  encourage  the  growth  of  continuous  line  segments, 

(2)  To  discourage  abrupt  breaks  in  line  segments, 

(3)  To  discourage  close  parallel  lines  (competitions)  and 

(4)  To  discourage  sharp  turns  in  line  segments. 

A  second  order  neighborhood  turns  out  to  be  sufficient  to  enforce  all  the  relationships 
we  want.  In  this  neighborhood  system,  each  MRF  element  is  adjacent  to  eight  others  (see 
Fig  s  1  and  2). 

The  second  order  neighborhood  has  cliques  of  sizes  1  through  4  (see  Fig.  3).  The  poten¬ 
tial  values  we  assign  to  various  configurations  of  these  cliques  are  shown  in  Fig.  4.  These 
values  form  the  specification  of  the  potential  functions.  Therefore  potential  functions  can  be 


seen  to  be  specified  by  about  10  parameters,  which  are  currently  assigned  in  an  ad  hoc 
manner.  The  rules  of  thumb  that  are  used  to  assign  values  to  these  parameters  are: 

Determine  Structure  Enforcers  For  each  clique,  attempt  to  determine  what  kind  of 
structural  relation  it  is  uniquely  capable  of  enforcing. 

Encode  Prior  Structural  Knowledge  By  assigning  "high"  potential  values  to  undesir¬ 
able  configurations  of  the  cliques  and  "low"  values  to  desirable  ones,  we  attempt  to  ensure 
that  the  final  estimate  will  contain  as  few  of  the  undesirable  ones  as  possible. 

Encode  Statistical  Prior  Knowledge  We  use  the  clique  consisting  of  the  singleton  node 
to  bring  the  first  order  statistics  (eg.  the  density  of  EDGEs)  of  the  MRF  into  line  with 
whet  we  already  know.  The  potential  of  the  clique  when  the  MRF  entity  is  an  EDGE  is  set 
to  our  estimate  of  the  log  of  the  (local)  odds  of  an  entity  being  an  EDGE  over  a  NON¬ 
EDGE,  and  is  set  to  0  when  it  is  a  NON-EDGE. 

A  point  to  be  noted  is  that  some  of  these  parameter  values  are  interdependent.  For 
example,  increasing  the  energy  for  "break"  (Fig.  4b)  and  "continuation"  (Fig.  4c) 
configurations  simultaneously  would  be  of  little  use,  as  the  increases  would  tend  to  cancel 
each  other  out. 

The  sensitivity  of  the  results  obtained  to  changes  in  the  parameters  specifying  the 
potential  functions  depends  upon  the  parameter  in  question.  Our  experience  is  that  chang¬ 
ing  the  potential  function  associated  with  the  1-clique  had  the  greatest  effect  on  the  final 
result,  followed  by  the  2-clique  and  4-clique  potential  functions,  in  that  order.  This  could  be 
because  the  singleton  clique  controls  first  order  statistics  and  the  larger  cliques  higher 
order  statistics,  which  are  known  to  be  less  important  in  distinguishing  images 
[Julesz  1981]. 

6.2.  Likelihood  Generation 

We  adopt  a  step-edge  with  white  Gaussian  noise  model  to  compute  the  local  likeli¬ 
hoods  of  an  entity  s  being  EDGE  or  NON-EDGE  -  P(0,  \  w,  =  EDGE)  and 
P(0,  | u,  =NON-EDGE).  The  observation  -  0,  -  is  a  1X4  or  4X1  window  of  brightness 
observations  surrounding  s.  This  window  of  intensity  values  is  assumed  to  be  a  realization 
of  one  of  the  possible  events  depicted  in  Figure  5,  corrupted  by  independent  Gaussian  noise 
The  reader  is  referred  to  [Sher  1987]  for  details  of  probabilistic  edge  detection. 

From  (3  2),  observe  that  scaling  P(0,|f)  for  every  HL  by  a  constant  factor  for  fixed  s 
does  not  change  the  a  posteriori  distribution.  This  fact  allows  us  to  use  the  likelihood 
ratios  -  P<^l,  .  ag  the  only  input  data,  thus  simplifying  the  computation  of 

the  stability  measures  (5.2).  Thresholding  the  likelihood  ratios  by  the  prior  (local)  odds  of 
an  entity  being  an  EDGE  results  in  the  thresholded  likelihood  ratio  (TLR)  configuration 
that  can  be  considered  as  an  MAP  estimate  obtained  without  using  contextual  information. 
In  our  experiments,  we  use  TLR  as  the  initial  configuration  whenever  possible. 

6.3.  A  General  Purpose  MRF  Simulator 

Our  experiments  use  an  interactive  general-purpose  MRF  simulator  package  with 
extensive  graphics  and  menu-driven  control  (Fig.  6).  This  package  takes  the  description  of 
the  MRF  and  the  likelihood  ratios  as  input  and  simulates  the  state  transitions  of  the 
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entities  comprising  the  the  MRF.  The  user  can  specify  the  estimation  algorithm  to  be  used 
and  also  the  initialization  of  the  MRF  •  each  entity  can  be  initially  set  to  either  a  NON¬ 
EDGE  or  to  its  TLR  state.  The  input  MRF  is  constrained  to  be  a  homogeneous  one,  so  as  to 
make  the  time  and  space  needed  to  run  simulations  reasonable. 

The  user  provides  the  description  of  the  MRF  to  the  simulator  in  a  file.  This  file  con¬ 
tains  a  specification  of  the  nodes  comprising  each  clique,  aB  well  as  the  potential  function 
associated  with  it.  The  user  specifies  all  cliques  that  a  node  could  belong  to  in  the  most 
general  case,  including  all  instances  of  a  particular  clique  type  that  contain  the  node,  (i.e., 
even  if  all  the  cliques  containing  a  node  are  instances  of  the  same  clique  type,  the  user 
specifies  each  instance  separately).  The  nodes  forming  a  clique  are  specified  by  their  coor¬ 
dinates  relative  to  the  node  of  interest,  which  is  defined  to  be  at  relative  coordinates  (0,  O' 
Boundary  conditions,  as  in  the  case  of  nodes  near  the  border  of  the  MRF,  are  taken  carp  or 
by  the  simulator.  The  potential  function  is  specified  as  a  function  that  takes  as  input  , 
configuration  vector  (a  vector  of  states  of  nodes  of  the  MRF)  and  returns  a  potential  va.ue. 
The  potential  function  is  associated  with  the  clique  description,  and  the  ordering  of  the 
node  states  in  the  configuration  vector  passed  it  is  the  same  as  the  order  the  nodes  are 
specified  in  the  description  of  the  clique  itself. 

The  simulator  performs  certain  preprocessing  actions  on  the  description  of  the  MRF 
provided  by  the  user,  to  promote  run-time  efficiency.  The  first  is  to  store  each  potential 
function  as  a  table  indexed  into  by  a  configuration  vector.  This  is  done  so  as  to  avoid  run¬ 
time  calling  of  the  user’s  potential  function  code  ,  which  can  be  quite  complex,  replacing  it 
instead  with  a  simple  table  lookup.  The  other  is  "clique  containment",  which  is  based  on 
the  observation  that  if  one  clique  completely  contains  the  other,  then  a  configuration  vector 
of  the  nodes  in  the  larger  clique  contains  implicitly  the  configuration  vector  for  the  smaller 
clique.  This  suggests  that  by  judiciously  "adding”  together  the  potential  functions  for  the 
clique  in  the  preprocessing  stage,  we  can  avoid  run-time  evaluation  of  the  potential  func¬ 
tion  for  the  smaller  clique.  This  simplifies  the  state  transition  energy  evaluation  by  reduc¬ 
ing  the  number  of  terms  to  be  summed  up.  If  floating-point  arithmetic  is  costly,  this  can 
save  considerable  computational  effort.  The  preprocessing  needs  to  be  done  just  once,  and 
can  be  performed  off-line. 

6.4.  Experimental  Results 

The  simulator  described  above  has  been  used  for  a  series  of  experiments  aimed  at 
comparing  the  performances  of  various  relaxation  algorithms  with  respect  to  the  goodness 
of  final  estimations  and  rate  of  convergence.  We  focus  upon  algorithms  using  the  MAP  cri¬ 
terion,  including  Highest  Confidence  First  (HCF),  Deterministic  Iterative  Relaxation  (DIR) 
and  stochastic  MAP  (simulated  annealing  with  Gibbs  Sampler  [Geman  and  Geman  1984]). 
The  results  obtained  by  using  stochastic  MPM  (Monte  Carlo  approximation  to  the  MPM 
estimate  [Marroquin,  Mitter,  and  Poggio  1985])  are  also  presented  for  the  sake  of  complete¬ 
ness  of  comparisons,  as  are  those  obtained  by  applying  3X3  Kirsch  operators  with  non¬ 
maximum  suppression.  The  annealing  schedule  for  the  stochastic  MAP  follows  the  one  sug- 

gested  in  [Geman  and  Geman  1984],  i.e.  T*  =  ^  )  w^ere  *8  ^e  temperature  for 

the  ft*  iteration,  with  c  =  4.0.  The  stochastic  MAP  was  run  for  1000  iterations  and  the 
stochastic  MPM  for  500  (300  to  reach  equilibrium,  200  to  collect  statistics). 
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6.4.1.  Comparison  of  Estimates 

Here  we  show  the  results  of  three  sets  of  experiments  (Fig  s  7  through  9).  The  figures 
for  each  set  contain  the  original  image,  the  result  from  the  Kirsch  operators,  the  TLR 
configuration  and  the  results  obtained  by  using  stochastic  MAP,  stochastic  MPM,  DIR 
(scan-line  visiting  order),  DIR  (random  visiting  order)  and  HCF  algorithms.  Except  in  the 
case  of  the  HCF  algorithm,  where  the  MRF  is  initialized  to  all  null  (uncommitted)  states, 
the  MRF  is  initialized  to  the  TLR  configuration.  The  MRF  specification  is  the  same 
throughout. 

Fig.  7a  shows  a  synthetic  50  pixel  square  "checkerboard"  pattern.  Each  of  patches  is 
10  pixels  across,  with  an  intensity  chosen  randomly  from  between  0  and  255.  The  image 
has  been  degraded  by  independently  adding  to  each  pixel  Gaussian  noise  with  a  mean  of  0 
and  a  standard  deviation  of  16.  The  HCF  and  stochastic  MPM  (Figs.  7g  and  7d)  are  the 
same,  and  have  completed  most  of  the  desired  edges.  The  DIRs  (Figs.  7d  and  7e)  have 
incomplete  edges  and  the  stochastic  MAP  has  some  undesired  edges  and  incomplete  desired 
edges  (Fig.  7c).  The  Kirsch  operator  result  is  not  shown  as  the  edges  in  this  image  are 
always  located  exactly  in  between  pixels,  while  the  Kirsch  operator  assumes  edges  to  be  at 
pixel  locations,  and  so  a  comparison  would  be  unfair  to  the  Kirsch  operators. 

Fig.  8a  shows  a  50  pixel  square  natural  image  of  a  wooden  block  with  the  letter  "P" 
on  it.  The  MAP  estimate  has  several  undesirable  lines  (Fig.  8d).  The  MPM  estimate  per¬ 
forms  poorly  on  the  right  edge  of  the  block  and  the  inner  ring  of  the  ”P".  The  DIR  scheme 
(serial  scan)  (Fig.  8f)  performs  better  than  the  random  scan  version  (Fig.  8g),  but  is  less 
than  satisfactory  on  the  leg  of  the  "P"  and  the  right  edge  of  the  block.  The  HCF  estimate 
(Fig.  8h)  does  not  suffer  from  the  above  flaws,  producing  clean,  connected  edges. 

Fig.  9a  shows  a  100X124  natural  image  of  4  plastic  blocks  with  the  letters  "U",  "R", 
"C"  and  "S"  on  them.  Again,  the  HCF  algorithm  produces  superior  results  (Fig.  9g).  It  has 
the  clearest  letter  outlines  and  also  is  alone  in  detecting  the  entire  bottom  edge  of  the  "R" 
block.  The  MAP  estimate  partially  detects  the  bottom  edge  of  the  "R"  block,  but  generates 
redundant  lines  (Fig.  9c).  The  MPM  estimate  has  clear  letter  outlines  but  does  poorly  on 
the  outlines  of  the  left  blocks  (Fig.  9d).  The  DIR  scheme  (scan-line)  does  well  on  the  letter 
outlines  but  poorly  on  the  block  outlines  while  the  random  scan  version  does  poorly  on  both 
(Figs.  9e  and  9f). 

To  test  the  robustness  of  the  algorithms,  we  conduct  further  experiments  using  a  likel¬ 
ihood  generator  with  a  less  complete  edge  model.  Since  offset  edges  (Fig.  5c)  are  not  con¬ 
sidered  here,  multiple  responses  become  significant  as  can  be  seen  from  the  TLR 
configuration  shown  in  Fig.  10a.  This  change  strongly  affects  the  estimates  produced  by  all 
the  algorithms  except  the  HCF,  as  can  be  seen  from  comparing  corresponding  pictures  in 
Fig.  9  and  Fig.  10. 

6.4.2.  Rates  of  Convergence 

We  restrict  ourselves  to  comparisons  between  deterministic  schemes,  as  stochastic 
schemes  do  not  have  any  convergence  criterion  per  se  -  the  point  of  convergence  is  depen¬ 
dent  upon  our  judgement  as  to  when  equilibrium  has  been  reached,  and  as  to  when  we 
have  gathered  enough  statistics  to  estimate  the  joint  (or  marginal)  probabilities  accurately 
(typically  several  hundred  iterations  are  needed).  The  deterministic  algorithms  (HCF  and 


DIR  (scan-line))  have  been  timed  on  images  of  various  sizes  using  a  Sun  3/260  with  floating 

point  acceleration.  The  results  are  shown  in  Table  1. 

6.5.  Analysis  of  Experimental  Results 

Goodness  of  Estimates 

(1)  The  HCF  algorithm  repeatedly  outperforms  all  other  algorithms,  giving  superior 
results  both  with  synthetic  and  real  image  data.  The  common  characteristics  of  the 
results  we  have  obtained  from  using  this  algorithm  are  that  they  all  fit  well  in  our 
model  of  the  world,  which  consists  of  smoothly  continuous  boundaries,  and  that  they 
are  consistent  with  the  observations. 

(2)  The  HCF  algorithm  also  appears  to  be  robust,  in  that  it  produces  an  estimate  con¬ 
sistent  with  the  observations  even  when  the  MRF  model  used  is  inadequate,  as  in  the 
experiment  using  the  less  sophisticated  edge  detector.  Since  our  MRF  model  does  not 
take  into  account  multiple  responses,  the  MAP  criterion  may  not  lead  to  the  ’1)651" 
results.  In  this  case,  the  local  minimum  found  by  the  HCF  algorithm  is  actually  better 
than  the  global  one  as  it  is  based  on  the  strength  of  external  evidence. 

(3)  The  DIR  algorithm  performs  inconsistently  and  its  results  depend  to  a  large  extent 
upon  the  initialization  of  the  MRF  and  the  visiting  order.  It  is  also  not  clear  which,  if 
any,  of  the  visiting  orders  studied  is  better  than  the  other. 

(4)  The  stochastic  MAP  algorithm  with  simulated  annealing  gets  stuck  in  undesirable 
local  minima,  suggesting  that  our  annealing  schedule  might  have  lowered  the  tem¬ 
perature  too  fast.  However,  an  appropriate  annealing  schedule  seems  hard  to  obtain  a 
priori. 

Convergence  Times 

(1)  The  HCF  algorithm  makes  a  perhaps  surprisingly  small  number  of  visits  before  con¬ 
verging.  Clearly,  due  to  the  initialization,  it  must  visit  every  node  at  least  once  What 
is  surprising  is  that  it  visits  each  node  on  the  average  less  than  101  times  before  con¬ 
verging.  What  this  implies  is  that  the  first  decision  made  by  a  node  is  nearly  always 
the  best  one. 

(2)  The  convergence  times  of  the  DIR  algorithms  are  unpredictable  -  they  vary  with  visit¬ 
ing  order,  MRF  initialization  and  even  upon  the  particular  image  given  as  input.  The 
HCF  algorithm,  in  contrast,  takes  almost  the  same  time  on  different  images  of  the 
same  size.  The  time  taken  by  the  HCF  algorithm  includes  the  time  taken  to  set  up  the 
heap  initially.  This  may,  in  some  circumstances,  be  a  little  unfair.  For  instance,  if  one 
has  to  process  data  online  from  various  information  sources  [Chou  and  Brown  1987a] 
[Poggio  1985],  the  heap  setting  up  cost  can  be  treated  as  a  preprocessing  cost  rather 
than  a  run-time  one. 

(3)  In  theory,  the  time  taken  by  the  HCF  algorithm  should  be  given  by  CjAf +c2Vlog2N, 
where  c(  and  c2  are  positive  constants,  N  the  number  of  entities  to  be  labeled  and  V 
the  number  of  visits.  V  here  iB  at  least  N  and  we  conjecture  that  on  the  average  it  is 
cN  for  some  small  (l<c<2)  constant  c.  Since  the  latter  term  should  dominate,  one 
would  expect  to  see  a  nonlinear  curve  in  a  plot  of  run  time  vs.  number  of  entities. 
However,  the  curve  is  very  nearly  a  straight  line,  which  indicates  either  that  the 


constant  c2  is  very  small,  or  that  the  changed  stability  values  do  not  propagate  very 
far  up  the  heap  on  the  average.  The  former  does  not  appear  to  be  true,  as  our  experi¬ 
ences  suggest  that  the  initial  heap  construction  takes  far  less  time  than  the  rest  of  the 

algorithm. 

7.  Conclusions  and  Future  Research 

We  have  described  a  new  approach  for  solving  the  labeling  problem.  The  Highest 
Confidence  First  algorithm,  aimed  at  approximating  the  MAP  estimate  with  a  prion 
knowledge  modeled  by  MRF’s  and  external  observations  represented  as  likelihoods,  leads  to 
outstanding  results  in  our  experiments  with  both  synthetic  and  natural  images.  Not  only 
is  this  algorithm  much  faster  than  stochastic  estimation  procedures,  it  also  converges 
predictably.  In  addition,  the  algorithm  is  robust  -  in  the  case  that  the  prior  model  proves 
inadequate,  it  produces  an  estimate  that  is  highly  consistent  with  the  observations. 

We  are  incorporating  the  Highest  Confidence  First  algorithm  in  a  multi-modal  seg- 
menter  described  in  [Chou  and  Brown  1987a]  and  believe  it  to  be  well  suited  to  a  scenario 
where  the  result  is  to  be  computed  incrementally  from  sparse  and  dynamically-arriving 
data,  possibly  from  multiple  sources. 

We  are  studying  methods  for  systematically  specifying  the  clique  potential  functions 
of  MRF’s  from  given  realizations.  We  are  also  analyzing  the  rapid  convergence  of  the  HCF 
algorithm  observed  in  our  experiments  from  a  theoretical  viewpoint. 

The  concept  of  a  confidence-based  heuristic  is  likely  to  be  useful  whenever  there  is  a 
set  of  cooperating  processes  attempting  to  reach  a  consensus.  The  idea  that  the  processes 
with  a  greater  degree  of  certainity  about  their  decision,  get  to  make  it  first,  is  intuitively 
appealing.  We  are  investigating  applications  of  this  idea  to  other  fields. 
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Figure  3 

Figure  I:  Relationship  between  MRF  entities  and  pixels.  Figure  2:  The  second- 
order  neighborhood  system.  Figure  3:  Cliques  in  neighborhood  system,  of  size 
greater  than  one.  (a)-(c):  size  2;  (d)-(e):  size  3;  (f)-(g):  size  4 
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Figure  4:  Potential  Assignments  for  Cliques 
(Configurations  not  shown  have  null  (0.0) 
potential  values) 
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Figure  8:  Experiment  set.  (a)  Natural  50X50  image  of  wooden  block,  (b)  Thinned  and  thres- 
holded  output  of  3X3  Kirsch  operators,  (c)  TLR  configuration,  (d)  Stochastic  MAP  estimate,  (e) 
Stochastic  MPM  estimate,  (f)  DIR  (scan-line  visiting  order)  MAP  estimate,  (g)  DIR  (random  visit¬ 
ing  order)  MAP  estimate,  (h)  HCF  result. 
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Figure  10:  Experiments  with  incomplete  edge  model  -  original  image  in  Fig.  da.  (a)  TLR 
configuration,  (b)  Stochastic  MAP  estimate,  (c)  Stochastic  MPM  estimate,  (d)  DIR  (scan-line  visit¬ 
ing  order)  MAP  estimate,  (e)  DIR  (random  visiting  order)  MAP  estimate,  (f)  HCF  result. 
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